Skip to content

fix(groupby): group by non-dimension coordinate names; fast multi-key grouping by names (#750, #753)#751

Open
FBumann wants to merge 9 commits into
masterfrom
fix/groupby-coord-name
Open

fix(groupby): group by non-dimension coordinate names; fast multi-key grouping by names (#750, #753)#751
FBumann wants to merge 9 commits into
masterfrom
fix/groupby-coord-name

Conversation

@FBumann
Copy link
Copy Markdown
Collaborator

@FBumann FBumann commented Jun 4, 2026

What

Brings LinearExpression.groupby to parity with xarray.Dataset.groupby for grouping by coordinate names — on the fast path, with no breaking change. Two related pieces:

1. Group by an attached non-dimension coordinate (closes #750).

expr = (1 * x).assign_coords(period=period)   # 'period' rides on the 'snapshot' dim
expr.groupby("period").sum()   # before: ValueError: period already exists as coordinate
expr.groupby(period).sum()     # before: KeyError: 'period'

Both now work and take the fast path, mirroring xarray. (Grouping by a dimension name always worked — only attached non-dim coordinates were broken.) The old workaround was to detach the coord first (expr.drop_vars("period").groupby(period).sum()).

2. Multi-key grouping by names now takes the fast path (closes #753).

expr.groupby(["period", "season"]).sum()   # before: silently dropped to the slow xarray fallback

It now rides the reindex fast path and returns the same output — one dimension per key, byte-identical to the fallback, sparse fill cells included. The pd.DataFrame grouper is untouched and keeps its compact stacked-MultiIndex output, so this is non-breaking.

How

In LinearExpressionGroupby:

  • _resolve_group normalizes a key: unwrap a single-element list (groupby(["period"]) → scalar), and resolve a string coord name to its coordinate so it takes the fast path.
  • sum() drops every coordinate aligned to the grouped dimension before reshaping, so an attached aux coord (including the one being grouped by) no longer collides on the final rename.
  • The groupby property detaches a free (non-indexed) coordinate before handing it to xarray — fixing the use_fallback=True path — but never a MultiIndex level.
  • A list of coord names (1-D, sharing one dim) is gathered into a value frame to ride the fast path, then unstacked back into one dimension per key.

Memory

One dimension per key is a dense cartesian grid, so a sparse key crossing materialises mostly-fill cells (measured ~100× vs the compact DataFrame grouper for a diagonal crossing — see #740). A UserWarning nudges sparse/high-cardinality users to the DataFrame grouper; it reads the collapsed MultiIndex levels, so it is O(observed), not O(N), and fires before unstack allocates the grid. Getting separate dims and compact storage would need a sparse/long-format kernel — tracked in #757 (the groupby-densification follow-up) under the sparse-kernel umbrella #756.

Tests

TestGroupbyByAttachedCoordinate asserts grouped vars/coeffs against hard-coded results on a deterministic model: single-key (name/DataArray × use_fallback), 2-D variable, dimension-coordinate-by-name, single-element list, MultiIndex level, and a pytest.raises row pinning that a list of DataArrays is unsupported. TestMultiKeyFastPath covers the multi-key fast path: fast == fallback (list/tuple, including a sparse crossing), separate-dims-not-stacked, sparse-combination-filled, DataFrame-grouper-stays-compact, and the blow-up warning (fires when sparse, silent when dense). Full suite green; ruff + mypy clean.

Context: relation to PyPSA usage and #744

This is a building block for the flat-dimension + auxiliary-level-coord direction discussed in #744. If n.snapshots becomes a flat dimension carrying period / scenario level coordinates (instead of a stacked pd.MultiIndex), aggregating an expression over a level becomes expr.groupby("period").sum() — exactly what this PR makes work, now also for multiple levels at once (groupby(["period", "scenario"])).

PyPSA still carries per-period workarounds citing a broken MultiIndex groupby (pydata/xarray#6836); that upstream bug is now fixed (verified on xarray 2025.9.0), but those comments are about MultiIndex grouping and per-period rolling, orthogonal to this PR.

Closes #750, #753. Drafted with an agent (Claude Code).

🤖 Generated with Claude Code

FBumann and others added 4 commits June 4, 2026 13:40
`LinearExpression.groupby` could not group by an attached non-dimension
coordinate. `expr.groupby("period").sum()` raised `ValueError: period
already exists as coordinate or variable name`, and passing the coordinate
`DataArray` (`groupby(period)`) raised because the fast path dropped only
the dimension index, then renamed the group dim onto a name still held by
the attached coordinate.

Fix both paths:

- `sum()` resolves a string group naming an existing coordinate to that
  coordinate so it takes the fast path, and drops every coordinate aligned
  to the grouped dimension (index, MultiIndex levels, auxiliary coords)
  before reshaping, since collapsing the dimension invalidates them all.
- The `groupby` property detaches an attached non-dimension coordinate used
  as the group before handing it to xarray, so xarray does not try to
  re-expand it when recombining groups (the `use_fallback=True` path).

`expr.groupby("period").sum()` now mirrors `xarray.Dataset.groupby`.

Closes #750

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reorganize the #750 coverage into TestGroupbyByAttachedCoordinate, a
parametrized matrix that asserts the grouped expression against hard-coded
`vars`/`coeffs` literals (on a deterministic 4- and 8-variable model)
instead of comparing to a sibling computation that could share the same
bug. Covers single-key (name / DataArray x use_fallback), multi-key
(list / tuple x use_fallback), an extra auxiliary coord on the grouped
dimension, and a 2-D variable that must keep its other dimension.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tract

A single-element key list (`groupby(["period"])`) now groups like the
scalar key, matching xarray -- it is unwrapped in both `sum()` and the
`groupby` property. Multi-key grouping must be spelled with names
(`["period", "season"]`); a list of `DataArray`s is unhashable and raises
in xarray itself, so linopy mirrors that (covered by an explicit
`pytest.raises` row in the matrix).

Also shorten the test class docstring to house style.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@FBumann FBumann marked this pull request as ready for review June 4, 2026 12:13
FBumann and others added 4 commits June 4, 2026 14:23
The GH #750 detach must only drop *free* (non-indexed) coordinates. The
earlier change also dropped a MultiIndex level when grouping by it via
`use_fallback=True`, leaving the dimension without an index
(`('snapshot',) are not coordinates with an index`). Guard the detach with
`group.name not in data.xindexes` so MultiIndex levels are left intact.

Grouping by a MultiIndex level now works on both paths (the pydata/xarray
6836 case, fixed upstream). Add a parametrized regression test over both
levels and both paths.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Extract the single-element-list unwrap + coordinate-name resolution shared
  by `sum()` and the `groupby` property into one `_resolve_group` helper,
  removing the duplication (and the drift between the two that caused the
  earlier MultiIndex-level regression).
- Drop the now-stale `(GH #750)` references from code comments; the link
  lives in the release notes.
- Add a test for grouping by a dimension coordinate name (the fast-path
  broadening), and note it in the release note.
- Simplify `test_multi_key`: a multi-key group always uses the xarray
  fallback, so drop the redundant `use_fallback` parametrization.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Grouping by a dimension coordinate or a MultiIndex level by name already
worked; only the non-dimension (free) coordinate case was broken. Correct
the release note, which had overclaimed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drop the verbose comment blocks; the rationale lives in the _resolve_group
docstring and the regression tests enforce the invariants (e.g.
test_multiindex_level guards the xindexes detach guard).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@FBumann FBumann changed the title fix(groupby): group by the name of a non-dimension coordinate (#750) fix(groupby): group by non-dimension coordinate names; fast & flat multi-key grouping (#750, #753) Jun 4, 2026
@FBumann FBumann force-pushed the fix/groupby-coord-name branch from 63e82ef to ae23a3c Compare June 4, 2026 14:48
@FBumann FBumann changed the base branch from master to feat/groupby-parity June 4, 2026 14:48
@FBumann FBumann changed the title fix(groupby): group by non-dimension coordinate names; fast & flat multi-key grouping (#750, #753) fix(groupby): group by the name of a non-dimension coordinate (#750) Jun 4, 2026
@FBumann FBumann changed the base branch from feat/groupby-parity to master June 4, 2026 14:55
@FBumann FBumann changed the title fix(groupby): group by the name of a non-dimension coordinate (#750) fix(groupby): group by non-dimension coordinate names; fast & flat multi-key grouping (#750, #753) Jun 4, 2026
@FBumann FBumann force-pushed the fix/groupby-coord-name branch from 63e82ef to de067dc Compare June 4, 2026 15:20
@FBumann FBumann changed the title fix(groupby): group by non-dimension coordinate names; fast & flat multi-key grouping (#750, #753) fix(groupby): group by non-dimension coordinate names; fast multi-key grouping by names (#750, #753) Jun 4, 2026
`groupby(["a","b"]).sum()` previously dropped to the slow xarray fallback.
Resolve a list of coordinate names (1-D, same dim) to a value frame so it
rides the existing reindex fast path, then unstack the stacked result back
into one dimension per key -- byte-identical to the fallback, sparse fill
cells included. The DataFrame grouper is untouched and stays compact (stacked
MultiIndex over observed combinations only), so this is non-breaking.

One dimension per key is a dense cartesian grid, so a sparse key crossing
materialises mostly-fill cells. Warn (pointing at the DataFrame grouper) when
the grid is much larger than the observed combinations; the check reads the
collapsed MultiIndex levels, so it is O(observed) and fires before unstack
allocates.

See #753; sparse-representation follow-ups tracked against #740.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant